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Abstract 

Cloud computing has made it possible for a user to be able to select a computing service precisely 
when needed. However, certain factors such as security of data and regulatory issues will impact 
a user's choice of using such a service. A solution to these problems is the use of a hybrid cloud 
that combines a user's local computing capabilities (for mission- or organization-critical tasks) with 
a public cloud (for less influential tasks). We foresee three challenges that must be overcome before 
the adoption of a hybrid cloud approach: 1) data design: How to partition relations in a hybrid 
cloud? The solution to this problem must account for the sensitivity of attributes in a relation as 
well as the workload of a user; 2) data security: How to protect a user's data in a public cloud with 
encryption while enabling query processing over this encrypted data? and 3) query processing: How 
to execute queries efficiently over both, encrypted and unencrypted data? This paper addresses these 
challenges and incorporates their solutions into an add-on tool for a Hadoop and Hive based cloud 
computing infrastructure. 

1 Introduction 

The emergence of cloud computing has created a paradigm shift by allowing parallel processing of massive 
amounts of data. Cloud computing has further segmented traditionally provided software services into 
SaaS, PaaS and IaaS. This segmentation allows users to choose the appropriate kind of computing service 
precisely when needed. Further, using cloud computing services can significantly lower a user's capital 
expenditure since they only pay for services that they use. However, a user needs to make an informed 
decision as to whether or not to use cloud services based on other factors such as the level of information 
privacy desired, regulatory issues and local computing capacity. A user may be tempted to use other 
secure data processing alternatives such as full homomorphic encryption pQ . Unfortunately such methods 
are very expensive as the the data size increases. Given these issues, for certain users it may be a better 
choice to adopt a hybrid cloud (public and private) approach rather than relying solely on a cloud 
service provider. Further, this hybrid solution enables certain mission- or organization-critical tasks to 
be executed locally at a user's site while allowing less important tasks to be outsourced to the public 
cloud. Moreover, this increases throughput while reducing operational costs with a high-level of data 
security. 

There are a number of technological issues that need to be addressed before the adoption of a hybrid 
cloud methodology. The first issue is, how to distribute data in a hybrid cloud? This is the data design 
problem which focuses on how data should be partitioned and where these partitions should be placed. 
The main reasons for data distribution are scalability, higher concurrency and greater throughput. There 
are a number of related design issues such as granularity of partitions and application requirements. 
Data design, especially in the cloud computing paradigm is a challenging task. This is because certain 
attributes of a user's data may be sensitive, in which case the user cannot release this information to 
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a cloud service provider unless it is encrypted. Our data design module takes into account this factor 
during the process of data fragmentation and fragment allocation. 

The next issue is, how do users protect themselves from cloud service providers who may be able 
to access their data? This issue is related to data security and is relevant for users since their data is 
placed at the provider's site. The goal of data security is to prevent the service provider from learning 
any meaningful information from the data. Data security can be achieved by encrypting the stored 
data. However, encryption presents a new set of challenges such as granularity of encryption and query 
processing over encrypted data. Our security module addresses these challenges. 



, R 2 ,--- ,R, n } | 



User Interface Layer 



A 1: A 2 ,-,A n 



Qi.Qz,-,Q k 



I 



Data Design Layer 



AT — non-sensitive 



Encryption 



Query Processing Engine 
I 



A p " 







Mapping 





A k 6 Qi'\s sensit ve 



Query Result 



Public 



Table 1 



View 1 



Table n 
I — 
• l • 

~l — 



Structure 



View n 



Hive 
+ Hive QL 

_, HDFS 



File 1 



File n 



AK = sensitive 




non-sensitive 



Private 



Table 1 



Table n 



View 1 



Structure 



View n 



Hive 
-)- Hive QL 

, HDFS 



File 1 



• • » 



File n 



Figure 1: The hybrid cloud architecture 



The final issue is, how can users query the cloud infrastructure without being aware of the separation 
of data in a hybrid cloud? This is the problem of query processing in a distributed environment. The goal 
of a cloud query processor is to transform a high-level query into a low-level query that can be executed 
over a hybrid cloud. This query processor must be able to take into account the cost of executing queries 
over a hybrid cloud containing both, unencrypted and encrypted data. Further, this query processor 
must be able to optimize query execution given the fragmentation of relations as well as the sensitivity 
of attributes. Our query processing module solves these problems effectively. 

Figure [T] presents the architecture of our proposed system. A user submits the original set of relations, 
R = {R\,R2,--- , Rm) and input queries, Q — {Qi, Q%, . . . , Qk}- The data design layer vertically 
partitions the set of attributes A = {Ai,A2, . . . , A n } over all relations R into A pu and A pr which are 
the sets of attributes stored on the public and private clouds respectively. These sets are determined 
by solving an optimization problem that minimizes the cost of executing Q over A pu and A pr . Then, 
the sensitive attributes in A pu are encrypted and mapped before being stored while the non-sensitive 
attributes are stored unencrypted in A pu and A pr . The query processing engine takes a query qi G Q 
or an ad-hoc query, q x , and transforms it into a query over a hybrid cloud. The results obtained by 
executing queries on both clouds are then combined into the final result that is passed on to a user. 
We use a Hadoop HDFS and Hive based cloud storage infrastructure for our implementation. Hadoop 
HDFS is a distributed file system that is designed to run on commodity hardware. In our architecture, 
the HDFS layer stores files that contain the vertical partitions created in the data design phase. A file 
may contain unencrypted or encrypted data depending on the sensitivity of attributes stored in that file. 
Hive is a data warehouse that is built on Hadoop. Hive allows a user to define structure for files that are 
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stored in the underlying HDFS. Furthermore, Hive provides a user the ability to query this structured 
data using a SQL-like query language called Hive QL. In our architecture, Hive is used to create the 
public and private components of a relation using the vertical partitions stored in HDFS. Then, a query 
q is split into two sets of sub-queries using Hive QL: q pu is executed on the public cloud, while q pr is 
executed on the private cloud. The results of q pu and q pr are combined on the private cloud and then 
returned to a user. 

Our contributions: To address the challenges we identified earlier with hybrid clouds we present 
the following novel contributions in this paper: 

• Algorithmic approaches to the data partitioning problem for a hybrid cloud in which we consider 
the cost of encrypted data storage on a public cloud. 

• An efficient distributed query optimization and processing engine. Our engine takes into account 
the cost of querying over encrypted data. 

• Implementation of these functions as an add-on for a Hadoop and Hive based computing infras- 
tructure. 

The rest of the paper is organized as follows: Section [5] reviews related work in the area of secure 
distributed data processing. Section [3J presents details of our approach to data processing in a hybrid 
cloud. In section U we present results of experiments conducted on our implementation. Finally, we 
present our conclusions and future work in section [5l 

2 Related Work 

In this section we provide a brief overview of the relevant research areas that are related to our work in 
this paper. 

A lot of research has focused on data partitioning and distributed query processing without explicitly 
considering the cost of data security. In our work, we categorically include this cost in both these areas. 

Data Partitioning: A lot of research has focused on the problem of data partitioning in single [5] 
and distributed systems [3] using a strategy such as that given in 0]. Reference [5] uses a graph-based, 
data-driven partitioning approach for transactional workloads. Our work explicitly considers the cost of 
querying encrypted attributes that will be stored on the public cloud as a result of the data partitioning 
process. 

Distributed Query Processing: Research efforts have also been made in the area of distributed 
query processing in the cloud as given in [5] . Distributed query processing has evolved from systems such 
as SDD-1 [7] that assumed homogeneous databases to DISCO [5] that operated on heterogeneous data 
sources and finally to Internet scale systems such as Astrolabe [3] . Since we need to execute queries over 
partitions containing unencrypted and encrypted data, we may not be able to process a query entirely 
on a public or private cloud. This leads to a cost model that is different from models that currently exist 
in literature. 

Privacy: The area of privacy-preserving query processing has also received much attention [HUE] ■ A 
homomorphic encryption based technique can be used to query over encrypted data |12] but is expensive 
when the data size increases. We use techniques given in to preserve security of data. However, the 
difference between our work and is that we can store and query data locally unlike [11] , 

We have also identified a recent work, called Relational Cloud [TO], that attempts to address the 
problems we have identified above. The difference between our work and Relational Cloud is that our 
data partitioning scheme considers the cost of querying encrypted attributes stored on a public cloud. 
Relational Cloud partitions data using a graph-based partitioning scheme without attaching any query 
cost constraints. These partitions are then encrypted with multiple layers of encryption and stored on 
a server. A query is then executed over the encrypted data with multiple rounds of communication 
between a client and server without considering the cost of decrypting intermediate relations. In our 
work, we explicitly consider the cost of queries that involve all three components of a hybrid cloud: a 
query over data in a private cloud, a query over non-sensitive (i.e., unencrypted) data and, a query over 
sensitive (i.e., encrypted) data on a public cloud. To the best of our knowledge, ours is the first work to 
explicitly estimate the cost of querying over unencrypted and encrypted data in a distributed setting. 
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3 Secure Data Processing 



This section focuses on the security layer, the query processing engine and the data design layer in that 
order. This is done since each successive layer is dependent on the concepts presented in the earlier layer. 

3.1 Data Security 

The data security layer prevents the cloud service provider from being able to gain any meaningful 
information from the sensitive data. The main challenge in this layer is the efficient execution of queries 
over encrypted data stored on a public cloud. We use the techniques given in to solve this problem. 
However, we can store and query data in a private cloud which was not possible in We provide a 
brief overview of certain important concepts from and refer the reader to [llj for a more detailed 
explanation. The query processing engine makes use of these techniques to perform query rewriting that 
allows an input query to be split into multiple sub-queries over a hybrid cloud. 

3.1.1 Sensitive Attribute Encryption and Storage 

The given set of relations, R±, R2, . . . , Rm is vertically partitioned into a partition stored on a user's 
private cloud and a partition stored on a public cloud. The public cloud partition is further fragmented 
into a fragment containing sensitive data and a fragment containing non-sensitive data. We now explain 
how the fragment containing sensitive data is encrypted and stored on a cloud service provider such that 
queries can be run directly over the encrypted data. 

The domain (Di) of a sensitive attribute R pu .Aj is divided into z partitions, p\,p2, . . . ,p z such that all 
partitions taken together cover T>i and no two partitions overlap one another. An identification function, 
ident, assigns an identifier, ident n P u Aj (p z ) to each partition of R pu .Aj such that identRp^.A^Py) ^ 
identRpu .AjiPz) if V ^ z. A mapping function, Map, is used to map a value v in T>i of attribute 
R pu .Aj to the identifier of the partition to which v belongs: MapRpv Aj {v) = ident^p^ ,Aj{Pz)- A 
value v of a sensitive attribute R pu .Aj in a vertical partition of a relation, R pu is then encrypted as 
E(v) — {encrypt{v), MapRpv .Aj{v)) . We use AES [13 in CTR mode [2] as the encryption function, E, 
while MapRpu .aAv) acts as an index on the attribute R pu .Aj. The corresponding decryption function, 
D, then decrypts E(v) to return the original value, t, after dropping the identifier that is stored along 
with the encrypted value. 

3.1.2 Mapping query conditions 

When a query is to be evaluated on a public cloud, the query conditions need to be mapped to conditions 
over the encrypted data stored on this public cloud. For example, in a selection operation with an equality 
condition, A; = v, the value, v, is mapped to the identifier of the partition that contains v as MapAi(v). 
Similar mapping conditions exist for other query conditions as shown in [llj . 

3.1.3 Relational operators over Partitioned Relations 

For a query to execute over partitioned relations that contain sensitive attributes, the underlying rela- 
tional algebra operators need to be modified to be able to use the functions ident and Map. For example 
if the selection condition contains a sensitive attribute, a partial result can be computed on the public 
cloud using the index for sensitive attributes, Ri n t = a Map This result can then be passed back 

to the query processing engine where it is filtered after being decrypted for tuples that do not match the 
selection condition, ac{R) = &c{D{Rint))- The remaining relational algebra operators are implemented 
inH3J. 

3.2 Query Processing Engine 

This section describes our query optimization and processing engine. We first present our distributed 
query execution cost model and then we give a query processing algorithm that is used to query a hybrid 
cloud. 
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3.2.1 Distributed Cost Model 

Algorithm [TJ is used to estimate the execution cost of Q queries using statistics for all relations. The 
execution cost of a query qi in a hybrid cloud setting can be computed as: 

Cj = freq(qi) x {max(Tf u ,Tf r ) + T co ), (1) 

where Tf u and Tf r are the local processing times on a public and private cloud respectively, and T co is 
the time to combine the intermediate results at the private cloud. The access frequency of a query qi is 
given by freq(qi). 



Algorithm 1 QPC() 

Input: Q, SR Output: Query execution cost, c 

1: C<-0 

2: for i <— 1 to Q.length do 

3: Cj <- 

4: Divide into gf" and <jf r using transformation rules 

5: for j <— 1 to qf u . length do 

6: if Ak G in is sensitive then 

7: MapA k (v) = ident Ak (pk) where v G 2\- of A fc 

8: end if 

9: Cj Cj + ?«2 x \Rf u \ + w 3 x |-Rf^p| {Compute public cloud cost} 

10: end for 

11: for j •<— 1 to qf r .length do 

12: c 2 ; ■<— Ci + tui x LR(m p {Compute private cloud cost} 

13: end for 

14: Cj (- q + W4 x l-Rfmp + Rtmp\ {Combination cost} 

15: c <- c + freq(qi) x Cj 

16: end for 

17: return c 



Early distributed cost models only considered minimizing the communication cost |15j . However, we 
believe that Equation[T]is a generalized way to estimate the query execution cost since the communication 
cost has only improved over time [TB] . Further, to the best of our knowledge, our cost model is the first to 
estimate the cost of query execution over unencrypted and encrypted data. The private cloud processing 
time is estimated as: Tf r = Wi X |-R$^p| (hue 12). is the size of the intermediate relation at the 

private cloud and weight w\ is estimated based on the private cloud infrastructure. The public cloud 
processing time is evaluated as: Tf w — W2 x \Rf u \ +1113 x (line 9). Rf u is the size of the relations 

over which q^ u will be executed and | R^ p | is the size of the intermediate relation created as a result of 
executing qf u . Weight W2 is estimated from the public cloud infrastructure while W3 is estimated from the 
network used to transfer data between the public and private clouds. Finally, T co = W4 x + Rtmp\ 

represents the time to combine intermediate results at the private cloud (line 14). Weight W4 is estimated 
based on the private cloud's capability to combine public cloud results with local results. W4 also captures 
the time to decrypt relations and filter unwanted tuples. |i?t mp | is estimated based on the query operator 
type. Our current work only supports simple SQL queries while we leave the support of nested queries 
as future work. The running time of this algorithm is 0(k) where k = X)"=i if" -length + qf r .length and 
n = Q.length. 

3.2.2 Distributed Query Processing 

Algorithm [2] presents details of our query processing engine. This algorithm is used to execute a query 
over a hybrid cloud. Algorithm [5] consists of four phases, each of which we motivate with the query given 
in Figure [5] that is a modified version of Q3 of TPC-H {T7\ and is given as follows: 

SELECT li_ok, sum(li_epr* (l-li_dis) ) , ord_od, ord_spr 
FROM cust, ord, li 

WHERE cust_mks=BU AND cust_ck = ord_ck AND li_ok = ord_ok AND ord_od < 1995-03-15 AND li_sd > 1995-03-15 
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Algorithm 2 QPEQ 



Input: qi Output: Query result, R res 
l: Divide into qf u and qf r using transformation rules 
2: Execute qf u and gf r in parallel 
3: for j «— 1 to qf u . length do 
4: if £ -R ptl in Qj is sensitive then 
5: MapA k {v) — identA k (Pk) where v G of Afc 
6: end if 

7: Execute ^ over {Public cloud execution} 

8: end for 

9: for j <— 1 to qf r .length do 

10: fltmp ^~ Execute gfj on i? pr {Private cloud execution} 
11: end for 

12: R res «— Combine R^ p and R V i mp {Result combination} 
13: return R res 
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We assume that the lineitem and order relations are sensitive and hence are encrypted on a public cloud 
while the customer relation is stored on a private cloud. Figure 2(a) shows an execution plan for the 
query which is transformed using the different phases of Algorithm [5] as given below: 
Query Rearrangement: To divide a query q into q pu and q pr we use relational algebra transformation 
rules. This phase will transform the query plan from Figure 2(a) to the plan given in Figure [2 (b) | using 
the commutativity rule of the join operation for our example. 

Public Cloud Execution: In this phase a generated sub-query(ies) is(are) executed over relations in 
a public cloud. Figure 2(c) shows how the query plan is divided into a public cloud query, q pu , and 
a private cloud query, q pr . We push as much processing to a public cloud as possible by mapping the 
original query conditions to conditions over encrypted attributes using the Map function. The box on 
the left hand side of Figure 2(c) represents q pu for our example. 

Private Cloud Execution: A sub-query(ies) is(are) directly executed over relations in a private cloud. 
The box on the right in Figure 2(c) represents the private cloud query, q pr , for our example. This query 
is executed in parallel with q pu on the public cloud. 

Post-processing: This phase combines the intermediate relations generated at the public and private 
clouds into the final result. Figure [2(d)] shows the post-processing step (q pr ) for our example query. This 
step decrypts the data that is received from the public cloud. Next, incorrect results are filtered from 
the decrypted data by applying the original query conditions. Finally, the results from the public and 
private clouds are combined and returned to a user. 



3.3 Data Design Layer 

The data design layer is concerned with partitioning a set of relations between a user's private cloud and 
a public cloud service provider. The process of partitioning is necessary since a user's private cloud may 
not have sufficient storage and/or processing power. This process becomes more complex in our setting 
since we want to protect the privacy of a user's data. We first define the data design problem and then 
present algorithmic strategies to solve this problem. 



3.3.1 Data Partitioning Problem 

The data design problem in a hybrid cloud setting is defined as follows: Minimize the execution cost 
of a set of queries, Q, over a distribution of attributes, A, among the public (A pu ) and private (A pr ) 
clouds. This problem is subject to the condition that A pr .size < W, where W is the disk space available 
on the private cloud. This is clearly an optimization problem, which we call the CLOUD-SUBSET- 
SELECTION (CSS) problem. There are an exponential number of subsets of A, each of which needs to 
be tested as a solution to the CSS problem. We can verify in polynomial time that the execution cost of 
Q queries is less than a bound C for a given A pu and A pr . Hence, the CSS problem belongs to the class 
of NP problems. Moreover, the 0-1 Knapsack Problem (0-1 KP) can be reduced to the CSS problem 
making the CSS problem NP-complete. A formal proof of NP-completeness is given in Appendix [AJ 



3.3.2 Algorithmic Solutions to Data Partitioning 

We use two different algorithmic strategies to produce a close to optimal solution for the CSS problem 
based on dynamic programming and hill climbing. The idea of using a dynamic programming solution to 
solve the CSS problem comes from the similarity between the CSS problem and 0-1 KP. Further, the hill 
climbing technique is inspired from the SDD-1 algorithm [7], which is a well known query optimization 
algorithm for distributed databases. 

Algorithm [3] (CSS-DP) is derived from a dynamic programming solution to 0-1 KP. However, there 
are several differences between the two problems. Firstly, 0-1 KP considers items with a weight Wi and 
a profit pi, while the CSS problem considers attributes with a size Assize and an associated cost Cj. 
0-1 KP tries to maximize the profit P while maintaining the weight of the knapsack less than W. The 
CSS problem tries to minimize the execution cost of Q queries given that A4 is placed in the private 
cloud under the constraint that A pr .size < W. The algorithm takes the following parameters as input: 
a set of attributes A, the size of the private cloud W, a set of input queries Q and the statistics for all 
relations as a set, Vi?i € R; stat(Ri) G SR. Note that SR captures the partitioning of A into A pu and 
A pr . Algorithm [3] begins by calling Algorithm [T] to compute an initial cost (cj n ) for Q queries given that 



7 



Algorithm 3 CSS-DPQ 



Input: A, W, Q, SR Output: An array p 

l: Update SR such that A pu = A, A pr = 

2: c in 4- QPC(Q, SR) {Compute initial cost} 

3: for i = to W do 

4: p[0,£]«-0 

5: end for 

6: for i 4— 1 to A.length do 



7: p[i,0]«-0 

8: for j = 1 to do 

9: if Ai.size < j then 

10: Update SR such that A 4 moved from A pu to A pr 

11: Ci 4- Cin- QPC(g, ST?) {Compute the proht for A z such that A pu = A pu - A+, A pr = A,} 

12: if Ci +p[i — 1, j — Ai.size] > p[i — 1, j] then 

13: j] 4- p t +p[i - - Ai.size] 

14: else 

15: p[i,j] <-p[i- 

16: end if 

17: else 

18: -«-p[i- 

19: end if 

20: end for 



21: end for 
22: return p 



A pu — A and A pr = (line 2). Then, the profit associated with each Ai can be computed as: Cj = Cj„— 
QPC(Q, using an updated SR such that is moved from A pu to A pr (line 11). Algorithm [3] then 
finds the maximum profit that can be achieved over A using the profit for each Ai € A. The term "profit" 
is used under the assumption that the private cloud is able to process queries faster than the public cloud. 
However, if the converse is true, there could be a loss in execution cost. Then, the algorithm may keep 
most of the attributes in the public cloud. After execution, Algorithm [3] returns an array of size n x W, 
where n = A.length, that contains the maximum profit that can be achieved in position [n, W]. The set 
A pr can be computed by starting at p[n, W] and tracing backwards based on the profit earned and size 
associated with every attribute. Note that the running time of this algorithm 0(nW x k), where k is 
the running time of Algorithm [TJ and the time to compute the set A pr is 0(n). 



Algorithm 4 CSS-HC() 
Input: A, W, Q, SR, bound Output: A pu 
l: A pr <- A 

2: Create A pu using selected strategy. {Initial solution} 

3: Update SR based on A pu and A pr 

4: a n QPC(Q, SR) {Compute initial cost} 

5: Cp rev 4 Ci n , C new 4 Ci n -f- 1 

6: while c new > Cprev \ \ iter < bound do 

8: Randomly swap a pair of attributes from A pu and A pr to get A^ 1 and A p n T such that A p n r < W 

9: Update SR based on A pu and A p n r 
10: c new i— QPC(Q, SR) {Compute new cost} 

11: end while 

12: A pu 4- A p n u 

13: return A pu 



Algorithm 0] (CSS-HC) uses a hill climbing technique and takes the same input parameters as CSS- 
DP. It also takes a bound on the number of random swaps to perform between A pu and A pr . An initial 
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solution is built (line 2) using one of the following greedy strategies: 1) Keep as many attributes from 
the query set in the private cloud as possible (CSS-HC-Qucry). 2) Keep as many sensitive attributes in 
the private cloud as possible (CSS-HC-sensitivity). An initial execution cost (cj n ) of Q queries is then 
computed using Algorithm [1] (line 4) . A pair of attributes from A pu and A pr is then randomly swapped 
(line 8) and the execution cost is recomputed (line 10). If this cost is better than the previous cost the 
new partitioning is retained. If the converse is true, the process of swapping attributes and recomputing 
execution cost is repeated. The total running time of the algorithm is 0(bound x k) where k represents 
the running time of Algorithm [1] 



4 Experimental Results 

This section presents the results of experiments conducted to compare the performance of the previous 
two algorithms. We first present details of our experimental setup followed by the set of experiments. 

Experimental Setup: Our experiments were conducted on two local clusters that are on different 
sub-networks of the same university intranet. We consider that this configuration simulates a real-world 
hybrid cloud well. This is because the average transfer speed between the nodes of our two local clusters 
(sa 672.04KB/sec) is the same as the average transfer speed between a node in our local network with 
Amazon S£| (« 683.05KB/sec for encrypted data and » 710.46KB/sec for unencrypted data) [18]. The 
first cluster consists of 4 nodes each with a Pentium IV processor with a 250 GB hard drive and 1GB 
of main memory and is used as the private cloud. The second cluster consists of 20 nodes each with a 
Pentium IV processor with 290GB to 360GB disk space and 4GB main memory and is used as the public 
cloud. Both clusters are setup using HadoopH vO.20.2 and Hivc| v0.6.0. The first cluster is configured 
with » 350GB disk space while the second is configured with w 4.7TB of disk space for HDFS. 

Security Functions: We used SHA-256 [19] as the ident function. Further, we used the built-in 
datatypes, int, double and string in Hive to represent the attributes of all relations from the TPC-H 
benchmark. The number of partitions, P, for a datatype could be varied from 1 to the number of unique 
values in the domain of an attribute. When P = 1 we get a high degree of security, however, query 
processing time increases since all values are mapped to the same partition. When P = the number of 
unique values in the domain of an attribute, query processing is fast since a small subset of values is 
mapped to a partition. However, the level of security is reduced since for example, a public cloud service 
provider could learn the data access patterns of queries. We used the following equation to determine P: 

log(max — min) 



where min and max represent the minimum and maximum values for the datatype as mandated by 
the TPC-H benchmark. For an integer datatype, min = —2,147,483,646 and max = 2,147,483,647. 
This leads to n = 31 partitions using the above equation. Similarly, for a double datatype, min = 
-9, 999, 999, 999.99 and max = 9, 999, 999, 999.99, leading to n = 34 partitions. For the string datatype, 
we created 36 partitions as a — z and — 9. Unless specified otherwise, we use these partitions in our 
experiments. To encrypt subsets of attributes we used the AES [13] in CTR mode [14] from the Java 
cryptographic extension. 

Queries: We have used the TPC-H benchmark [T7] with a scale factor 300 (ss 323GB) in our 
experiments. We did not run experiments for larger databases since we think the current case gives us 
sufficient insight into the workings of the algorithmic strategies. The first experiment used Q10 of TPC-H 
without the grouping and aggregate operations. The next two experiments used a query workload of 
100 queries containing modified versions of TPC-H queries Ql, Q3, Q6 and Q10. In particular, we do 
not perform grouping and aggregate operations in any query. Further, freq(qi) was randomly selected 
between 1 and 1000. Additional details of the workload preparation are given in Appendix IB.21 

Preliminary Experiments: We ran a set of preliminary experiments to estimate the weights defined 
in our cost model. These experiments were run only once and generate weights that are effective as will 
be shown. The values generated were: w% = 0.000545146, w 2 = 0.000072686, w 3 = 0.000001488 and 
u>4 = 0.0000041. Details of these preliminary experiments are provided in Appendix IB. II 

1 http:/ /aws. amazon.com/s3/ 
2 http:/ /hadoop. apache. org/ 
3 http: / /hi ve.apache.org/ 
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Figure 3: Comparison of CSS-DP and CSS-HC when the number of partitions, P, is varied 



Experiment with changing the number of partitions (P): The aim of this experiment was to 
compare the algorithms when P is increased while the private cloud size (« 90GB) and the sensitivity 
of attributes (~ 50%) are fixecjj]. Further, we wanted to show that although wi > W2, the choice of P 
affects the query performance on the public cloud. The solid and dashed lines in Figure [3] represent the 
running times when A pu — A (All-Public) and A pr — A (All-Private) respectively. The running time 
is constant for All-Private as it does not use partitions for query processing. For All-Public, the time 
decreases as P is increased. When P is small, the query takes longer to perform a join on the public 
cloud, since a large number of values map to the same partition. More time is also spent in transferring 
data to, and decrypting data on, the private cloud. As P increases, the time needed to perform a join 
as well as the transfer and decryption time reduces. Both the CSS-HC techniques perform similar to 
All-Private. This is because they always leave attributes from the query set or sensitive attributes, in 
the private cloud. When P is small, CSS-DP picks attributes such that Aj £ A pr when Aj £ Q. The 
distribution of data for CSS-DP in such a case, for example for P — 4 is: about 76% data in the public 
cloud and 24% data in the private cloud. This leads to a running time that is better than All-Public 
and All-Private. However, as P increases more attributes from the query set are pushed to the public 
cloud. This is because the time taken to perform a join on the public cloud followed by decrypting and 
filtering intermediate results is much lesser than performing the query on the private cloud. 



-All-Public 
All Private 
CSS-DP 

■ CSS-HC-query 

■ CSS-HC-sensitivity 



2 20000- 



10% 30% 50% 80% 100% 

% of total dataset size in private cloud (W) 



Figure 4: Comparison of CSS-DP and CSS-HC when the private cloud size, W, is scaled 
Experiment with changing the private cloud size (W): The goal of this experiment is to 

4 Our running times are very similar to the timings in I20I even though we use a less powerful cluster than I20| , 



10 



compare the performance of our algorithms when W increases while the sensitivity of attributes is fixed 
randomly (at w 40% over all relations) and the default partitions of 31, 34 and 36 are used. The dashed 
line in Figure S] represents the running time for All-Private. Further, All-Public and CSS-DP overlap 
in Figure |4] This is expected since from Figure [3] we see that the running times for these two cases are 
similar for P > 15. CSS-DP performs much better than CSS-HC. When an attribute Ai G Q is moved 
from A pu to A pr , it would result in a loss in execution cost. Therefore, CSS-DP picks attributes such 
that Aj G A pr when Aj ^ Q. This is expected, since w± > W2, however, this is not a general trend as was 
shown by the previous experiment. On the other hand, both CSS-HC techniques start with an initial 
solution that is iteratively improved. However, since w\ > W2, the initial estimate of execution cost is 
already much higher than the CSS-DP case. Hence, a random swap of any Ai G A between A pu and A pr 
does not change the execution cost significantly. For the W — 10% case, the CSS-HC techniques perform 
better than for the other W cases. The CSS-HC techniques store most of the attributes needed by the 
query set on the public cloud and hence query processing is much faster. As W scales, more attributes 
from the query set are brought into the private cloud. Therefore, the processing time becomes as slow 
as the case when A pr = A. 
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Figure 5: Comparison of CSS-DP and CSS-HC when attribute sensitivities, S, are increased 

Experiment with changing sensitivities (S): This experiment measures the time to run the 
workload when W is fixed (at « 150GB) while S is varied from some attributes being sensitive to all 
being sensitive. Figure [5] shows a comparison of the algorithmic strategies and we see results that are 
similar to Figure [4j CSS-DP selects attributes such that Aj G A pr when Aj ^ Q; this makes the query 
execution time for the workload much faster. The CSS-HC strategies select an initial solution based on 
the selected criteria (query or sensitivity). Again, since w\ > W2, neither of these strategies produces 
a good workload execution time when compared with CSS-DP. For the S = 30% case the CSS-HC- 
sensitivity technique produces a query workload time that is better than the other S cases. This is 
because the initial solution and subsequent random swaps leave most of the query set attributes in the 
public cloud. 

General Observations: We observe that for some cases of CSS-HC the execution time is higher 
than when A pr = A. This is due to the additional time for the post-processing step. We also observe that 
the weights estimated by the preliminary experiments perform well. The weights capture the purpose 
for which they were measured as shown by the weights for private (wi) and public (W2) cloud processing. 

5 Conclusions and Future Work 

With the advent of cloud computing, a hybrid cloud may be suitable for users who wish to balance data 
security with scalable data processing. We have identified three challenges that must be overcome before 
this approach can be adopted. 

The first challenge deals with data partitioning between a private cloud and a service provider when 
there are sensitive attributes in the data. We have developed two algorithmic approaches that produce 
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a close to optimal solution to this optimization problem. The second challenge is how to store a user's 
data securely on a cloud service provider? We have used existing encryption techniques to store a user's 
sensitive data on the cloud service provider. Moreover, this technique allows us to push most of the query 
processing work to the cloud service provider without the need of decrypting the stored data. Finally, 
the last challenge addresses the problem of distributed query processing over unencrypted and encrypted 
data. We have developed a cost model that estimates the cost of query execution over unencrypted and 
encrypted data. We have also presented a query processing engine that splits a user query into a public 
and private cloud query (ies). Each of these query (ies) is(are) then executed at each site using the best 
available local query plan. 

We are exploring the following areas for future research: 1) We have only considered a vertical 
partitioning of relations in this paper which will be extended to include horizontal and hybrid partitioning 
schemes. 2) Our cost model considers only simple SQL queries. We plan to build a more sophisticated 
model with support for nested queries. 3) In this paper we used Hadoop and Hive as the underlying 
cloud computing technologies. We aim to extend this work with more experiments into a generalized 
tool that will work with other existing public cloud services. 
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A The CSS Problem is NP-complete 



Given: A set of relations, R — {i?i, i?2, • • • , Rm}, a set of attributes, A — {Ai,A 2 , . . . , A n } over all rela- 
tions R where an attribute Aj 6 {sensitive, non-sensitive}, a set of input queries, Q = {Qi, Q2, ■ ■ ■ , Qk}- 

Problem: We have the following optimization problem for CLOUD-SUBSET-SELECTION (CSS), 

k 

minimize freq(qj) x QPC gi (X) 



i=i 



subject to ^2s(Aj) x Xj < PRIVATE_CLOUD_SIZE 

i=i 

where x — { ^ ^ A; * s m t ne P r i va te cloud; 

J \ if Aj is in the public cloud, 

and s (Aj) denotes the size of attribute Aj. 

We convert the optimization problem to a decision problem: 

Problem: Is there a partitioning of A into A pu and A pr such that the cost of executing Q queries over 
X = {A pu ,A pr } is at most C, where Aj G A pu if the corresponding x,- = and Aj G A pr if the 
corresponding Xj = 1? 

Lemma A.l. The CSS problem, belongs to the class NP. 

Proof. We shall provide a two-input, polynomial-time algorithm Al that can verify CSS. One of the 
inputs to the algorithm Al is a set of queries Q while the other input is a certificate corresponding to a 
partitioning of the attribute set A into A pu and A pr . 

Algorithm Al is constructed as follows: For each query qi 6 Q, Al determines the cost of executing 
qi given the partitions A pu and A pr , i.e., QPC qi (X). We assume that QPC qi (X) can be computed in 
polynomial time. If the sum of execution costs of all queries Q is less than the bound C, the algorithm 
outputs 1, since the partitioning of A into A pu and A pr provides a cost less than or equal to the bound 
C. Otherwise, Al outputs 0. 

Whenever a partitioning of A into A pu and A pr that produces an execution cost over all queries Q 
that is less than or equal to C is input to algorithm Al, there is a certificate whose length is polynomial 
in the size of A and that causes Al to output a 1 . Whenever a partitioning of A into A pu and A pr that 
produces an execution cost greater than C is input, algorithm Al outputs a 0. Algorithm Al runs in 
polynomial time. Thus, CSS can be verified in polynomial time, and CSS G NP. □ 

Lemma A. 2. The CSS problem is NP-hard. 

Proof. To prove that CSS is NP-hard we show that the 0-1 Knapsack Problem (KP) < P CSS. We show 
that any instance of 0-1 KP can be reduced in polynomial time to an instance of the CSS problem. 

We first define 0-1 KP as follows: Given a set of n items and a knapsack, with pj = profit of item j, 
Wj = weight of item j, c — capacity of the knapsack, select a subset of items so as to 



maximize z = ^^PjXj 



3 = 1 



subject to ^^WjXj < c, 

1 if item j is selected; 



where £,• = •»_ 

J 10 otherwise 

and j G N = {1,2, ...,n}. 
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We convert this problem into the following minimization problem subject to the same conditions as 
before, 

n 

minimize z = ~Pj x j 

0-1 KP can then be recast as the following decision problem: Can we achieve a profit of at most P 
without exceeding the weight c? 

The reduction algorithm begins with an instance of 0-1 KP. Let X = {x\, X2, ■ ■ ■ , x n } be the set of items 
each of which is associated with a profit pj and weight Wj, where j e N = {1, 2, . . . , n}. Also, let c be the 
capacity of the knapsack. We will construct an instance of the CSS problem with a set of attributes A 
over all relations R from the set of n items such that the 0-1 KP instance is satisfiable if and only if the 
CSS instance is satisfiable. Satisfiability in these problems means that the decision problem is answered 
with a 'y es '- The instance of the CSS problem is constructed as follows: 

For every item Xj £ X, the instance of the CSS problem has an attribute Aj. Further, the weight 
wj of an item Xj corresponds to the size of the attribute Aj, i.e., Wj = s(Aj). An initial cost, Ci„ is 
computed such that A pu = A and A pr — 0. Then, the profit of item Xj corresponds to the execution 
cost of Q queries (denoted as Cj) over a partitioning of A into X as given in the minimization problem of 
CSS such that Aj £ A pr . The profit can be computed as: Cj — Ci n — f re( l{ ( li) x QPC qi {X). This 

means that PjXj — CjXj where Xj = 1 for Aj in the CSS problem. Then, the total profit P becomes the 
execution cost, C, over all queries Q. Also the total capacity of the knapsack, c, becomes the size of the 
private cloud, PPJVATE_CLOUD_SIZE, which is computed as Y^j=i s (Aj) x Xj; \/Xj — 1. This instance 
of CSS can easily be computed from the instance of 0-1 KP in polynomial time. 

We now show that this transformation is a reduction under the assumption that QPC qi (X) can be 
computed in polynomial time for any query qi £ Q. First, suppose that the given instance of 0-1 KP is 
satisfiable. Then, we have a subset X C X such that the total profit P = Y^=iPj^ x £ A < P. We 
claim that X corresponds to A pr . Any element Aj will only be added to A pr when both conditions: 
Cj < C and £" =1 s(Aj) x Xj < PRIVATE.CLOUD.SIZE, hold. 

Conversely, suppose that the CSS problem instance has a partitioning of A into A pu and A pr that 
satisfies all constraints. Every Xj corresponding to an Aj £ A pr can be selected from the instance of 0-1 
KP to form the set of elements that will achieve at most profit P. This is because, each Xj will achieve 
at most profit pj < P and weight Wj < c. □ 

Theorem A. 3. The CSS problem is NP-complete. 

Proof. Immediate from Lemmas IA.1I and IA.2I and the definition of NP. □ 

B Additional Details of the Experiments 

The TPC-H benchmark is a decision support benchmark that consists of a schema that is typical of 
any business organization [T7]. The TPC-H benchmark provides a system that inspects large amounts 
of data by executing queries with a high degree of complexity that are derived from critical business 
questions. The TPC-H benchmark consists of 8 relations and 22 queries having a realistic context that 
capture the business activities of a wholesale supplier |17| . 

B.l Preliminary Experiments 

A set of preliminary experiments was run to estimate the weights wi to W4 that are used in out cost model. 
Weights wi and W2 represent the local processing times on the private and public clouds respectively 
and are estimated by running the same set of 4 queries on both the clouds. These queries consist of Ql, 
Q5 and Q13 from TPC-H [T7] while the 4 th query is as follows: 

select * from lineitem 1 join orders o on 1 . l_orderkey = o.o_orderkey 
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The four queries were selected to have a mix of low and high selectivity. From the average running times 
of each query we determined w\ as w\ = j± + 1| + 1| + |j. t\ represents the time to run Ql on the private 
cloud and bi represents the number of bytes generated in the result of an execution of Ql. The weight 
W2 is computed in the same way as w\. Using this procedure we have estimated, w\ = 0.000545146 and 
w 2 = 0.000072686. Since w\ > u>2, the private cloud processing is slower than public cloud processing 
for our hybrid cloud. However, this is not a general rule and the converse may also be true. Our cost 
model and partitioning algorithms capture either of these behaviors. 

Weight w 3 denotes the time required to transfer R^p to the private cloud and is estimated using the 
following query: 

select * from lineitem limit x, 

where we vary x from 10% to 100% of the number of tuples in the lineitem relation. This query was 
selected since the lineitem relation is the largest of all the TPC-H relations and W3 can be best estimated 
when a large amount of data is transferred. We then estimated W3 = 0.000001488 by averaging the 
running time of the previous 10 queries. 

The weight w 4 is used to capture the time taken at the private cloud to combine the intermediate 
results, Rtmp an( i -^tmp> obtained from the public and private clouds respectively. The same query used 
to estimate W3 was also used to estimate 104. However, the lineitem relation was partitioned between the 
public and private clouds in the following way: 

Private cloud: l_orderkey l_partkey l_quantity l_linestatus l_shipdate l_shipinstruct 
Public cloud sensitive: l_suppkey l_linenumber l_extendedprice l_commitdate l_shipmode 
Public cloud non-sensitive: l_disocunt l_tax l_returnflag l_receiptdate l_comment 

The weight W4 = 0.0000041 was then estimated by averaging the time to combine the results obtained 
from the different partitions for the given 10 queries. 

B.2 Query Workload Preparation 

We prepared a TPC-H database of scale factor 300 (« 323GB) using the TPC-H dbgen tool. We also 
created a query workload of 100 queries using 4 TPC-H queries (Ql, Q3, Q6 and Q10). For each of these 
queries the predicates in the query are randomly modified to vary the range of the data that is accessed. 
We summarize the ranges used for each predicate below: 

1992-01-01 <= l_shipdate <= 1998-12-31 

c_mkt segment = {AUTOMOBILE, BUILDING, FURNITURE, MACHINERY, HOUSEHOLD} 
1992-01-01 <= o_orderdate <= 1998-12-31 
0.00 <= l_discount <= 0.10 
1 <= l_quantity <= 50 
l_returnflag = {R, A, N} 
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